Sentiment Analysis of Twitter Feeds to Find Your Favorite Airline

CS109 Final Project: Nicholas Ruta, Ayin Mokrivala, and Anna Whitney

Overview and Motivation

There are an average of 6,000 tweets produced on Twitter per second. We think Twitter provides a great value to do sentiment analysis on text. Twitter posts are mostly public and can be used for such studies extensively. Also, frequent use of hashtags makes it more interesting to draw conclusions.

Sentiment analysis on airlines intrigues us since the industry is heavily price oriented. Often prices for tickets from different airlines are in similar range, putting emphasis on the quality of travel experience for the customer. But the customer understanding of the airline is commonly based on personal experience or general news. The customer is interested to know which airlines have a better reputation since the ticket prices are alike.

We did a sentiment analysis on tweets provided by the Twitter Streaming API in order to find the most preferable airline. We chose a 'top ten' selection of airlines for the analysis. The selection was made based on two factors. The first factor being an excellent article written by Nate Silver and the team at fivethirtyeight.com that describes a recent 'best and worst airlines' analysis they did. The second factor being our initial analysis of the data collection process to verify that a sufficient amount of data is created for each airline. The ten airlines we ended up with were United, Alaska Air, Frontier, Hawaiian Air, Virgin America, Southwest Air, Delta, JetBlue, Spirit Airlines and American Air.

We created a Python script to collect raw data from the Twitter Streaming API:

In [3]:
#MUST ADD YOUR TWITTER DEVELOPER CREDENTIALS FIRST
#you can rename this cell as get_twitter.py and run from the command line using this command - 
# python get_twitter.py > twitter_data.txt

#Import the necessary methods from the tweepy library
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream

#Set the twitter developer credentials to access the Twitter Streaming API
access_token = ""
access_token_secret = ""
consumer_key = ""
consumer_secret = ""

#This is a basic listener that prints received tweets to stdout.
class StdOutListener(StreamListener):

    def on_data(self, data):
        print data
        return True

    def on_error(self, status):
        print status

        
#THIS SECTION IS COMMENTED OUT SINCE THE DEVELOPER CREDENTIALS ARE NOT IN PLACE ABOVE
# if __name__ == '__main__':

    #This handles Twitter authetification and the connection to Twitter Streaming API
#     l = StdOutListener()
#     auth = OAuthHandler(consumer_key, consumer_secret)
#     auth.set_access_token(access_token, access_token_secret)
#     stream = Stream(auth, l)

    #Use the filter to capture data from the stream by keywords -
#     stream.filter(track=['@united', '@AlaskaAir', '@FlyFrontier', '@HawaiianAir', '@VirginAmerica','@SouthwestAir','@Delta','@JetBlue','@SpiritAirlines','@AmericanAir','#united', '#AlaskaAir', '#FlyFrontier', '#HawaiianAir', '#VirginAmerica','#SouthwestAir','#Delta','#Jetblue','#SpiritAirlines','#AmericanAir'], async=True)

We combined all of the .txt files of raw data created from the above script into one clean and final .json file:

In [4]:
# #Uncomment if you want to combine .txt files from the above twitter streaming api process. 

# #take in the combined/entire dataset of json rows 
# #remove blank lines and only write the row to the 'cleanfile.json' file if it is valid JSON

# import json
# import fileinput
# import glob

# a function to verify that a row in the raw data file is valid JSON. We noticed that the Twitter Streaming API 
# did occasionally return data in error and had to take this step to clean the dataset
# def is_json(myjson):
#   try:
#     json_object = json.loads(myjson)
#   except ValueError, e:
#     return False
#   return True

# the glob library makes it easy to grab all of the text files we placed in a tweets folder
# these were the entire collection of raw data we brought in from the Twitter Streaming API over the Nov-Dec 2015 
# timeframe. 
# file_list = glob.glob("tweets/*.txt")
# combined_file_name = 'combined_result.json'


# with open(combined_file_name, 'w') as file:
#     input_lines = fileinput.input(file_list)
#     file.writelines(input_lines)
    

# f = open('clean_final_file.json','w')
# for line in open(combined_file_name):
#   line = line.rstrip()
#   if line != '':
#         if is_json(line) is True:      
#             f.write(line + "\n") # python will convert \n to os.linesep
# f.close() # you can omit in most cases as the destructor will call it

Our project focuses on analyzing the text of over 100,000 tweets about airlines using sentiment analysis and LDA to determine which airline receives the most positive or negative attention on Twitter, and what topics people are happy or sad about with regards to each airline.

We start by importing the modules we need.

In [5]:
import json
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt

The Twitter Streaming API returns JSON objects. We took the resulting JSON and used the Python libraries json and pandas to create a list of 'raw_data' to be used in our text processing and visualizations. Our data can be downloaded from Dropbox, and the below assumes that the JSON file containing the data is placed in the directory above the directory containing this notebook.

In [6]:
#create a file to contain the 'raw data' from the twitter streaming api
raw_data_path = '../tweets_all@#_103653.json'

#create a list to hold the tweets
raw_data = []

#create and open the new file
tweets_file = open(raw_data_path, "r")

#append to the tweets list from the raw data file
for line in tweets_file:
    try:
        tweet = json.loads(line)
        raw_data.append(tweet)
    except:
        continue       
print "Number of raw data rows - ", len(raw_data)
print "\n"
print "Here is what a JSON row in the raw data from the Twitter Streaming API looks like - "
print "\n"
print raw_data[0]
Number of raw data rows -  103653


Here is what a JSON row in the raw data from the Twitter Streaming API looks like - 


{u'contributors': None, u'truncated': False, u'text': u'#United States Football Tickets\xa0News https://t.co/82wIUcjYt6 https://t.co/cyGk7msfUQ', u'is_quote_status': False, u'in_reply_to_status_id': None, u'id': 671312538764599296, u'favorite_count': 0, u'source': u'<a href="http://publicize.wp.com/" rel="nofollow">WordPress.com</a>', u'retweeted': False, u'coordinates': None, u'timestamp_ms': u'1448888356282', u'entities': {u'user_mentions': [], u'symbols': [], u'hashtags': [{u'indices': [0, 7], u'text': u'United'}], u'urls': [{u'url': u'https://t.co/82wIUcjYt6', u'indices': [37, 60], u'expanded_url': u'http://buyfootballtickets.xyz/index.php/2015/11/30/united-states-football-tickets-news-37/', u'display_url': u'buyfootballtickets.xyz/index.php/2015\u2026'}], u'media': [{u'expanded_url': u'http://twitter.com/rvtkq_t/status/671312538764599296/photo/1', u'display_url': u'pic.twitter.com/cyGk7msfUQ', u'url': u'https://t.co/cyGk7msfUQ', u'media_url_https': u'https://pbs.twimg.com/media/CVD7Mj9UAAEGQvd.jpg', u'id_str': u'671312537707610113', u'sizes': {u'large': {u'h': 252, u'resize': u'fit', u'w': 500}, u'small': {u'h': 171, u'resize': u'fit', u'w': 340}, u'medium': {u'h': 252, u'resize': u'fit', u'w': 500}, u'thumb': {u'h': 150, u'resize': u'crop', u'w': 150}}, u'indices': [61, 84], u'type': u'photo', u'id': 671312537707610113, u'media_url': u'http://pbs.twimg.com/media/CVD7Mj9UAAEGQvd.jpg'}]}, u'in_reply_to_screen_name': None, u'id_str': u'671312538764599296', u'retweet_count': 0, u'in_reply_to_user_id': None, u'favorited': False, u'user': {u'follow_request_sent': None, u'profile_use_background_image': True, u'default_profile_image': False, u'id': 3776628253, u'verified': False, u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/650503531057692672/kiexw52X_normal.jpg', u'profile_sidebar_fill_color': u'DDEEF6', u'profile_text_color': u'333333', u'followers_count': 48, u'profile_sidebar_border_color': u'C0DEED', u'id_str': u'3776628253', u'profile_background_color': u'C0DEED', u'listed_count': 4, u'profile_background_image_url_https': u'https://abs.twimg.com/images/themes/theme1/bg.png', u'utc_offset': None, u'statuses_count': 4505, u'description': None, u'friends_count': 121, u'location': None, u'profile_link_color': u'0084B4', u'profile_image_url': u'http://pbs.twimg.com/profile_images/650503531057692672/kiexw52X_normal.jpg', u'following': None, u'geo_enabled': False, u'profile_background_image_url': u'http://abs.twimg.com/images/themes/theme1/bg.png', u'name': u'Jerri T. Johnson', u'lang': u'vi', u'profile_background_tile': False, u'favourites_count': 23, u'screen_name': u'rvtkq_t', u'notifications': None, u'url': None, u'created_at': u'Sun Oct 04 02:51:16 +0000 2015', u'contributors_enabled': False, u'time_zone': None, u'protected': False, u'default_profile': True, u'is_translator': False}, u'geo': None, u'in_reply_to_user_id_str': None, u'possibly_sensitive': False, u'lang': u'en', u'created_at': u'Mon Nov 30 12:59:16 +0000 2015', u'filter_level': u'low', u'in_reply_to_status_id_str': None, u'place': None, u'extended_entities': {u'media': [{u'expanded_url': u'http://twitter.com/rvtkq_t/status/671312538764599296/photo/1', u'display_url': u'pic.twitter.com/cyGk7msfUQ', u'url': u'https://t.co/cyGk7msfUQ', u'media_url_https': u'https://pbs.twimg.com/media/CVD7Mj9UAAEGQvd.jpg', u'id_str': u'671312537707610113', u'sizes': {u'large': {u'h': 252, u'resize': u'fit', u'w': 500}, u'small': {u'h': 171, u'resize': u'fit', u'w': 340}, u'medium': {u'h': 252, u'resize': u'fit', u'w': 500}, u'thumb': {u'h': 150, u'resize': u'crop', u'w': 150}}, u'indices': [61, 84], u'type': u'photo', u'id': 671312537707610113, u'media_url': u'http://pbs.twimg.com/media/CVD7Mj9UAAEGQvd.jpg'}]}}
In [7]:
tweets = pd.DataFrame()

We took the raw data from the Twitter Streaming API and placed it in a Pandas Dataframe. There are many fields provided by the Twitter API, we used the Python library map function with a lambda to set pandas df columns for each of the fields of potential interest. To maintain a clean df, we set the value to 'None' in places where there is no value returned by the Twitter Streaming API:

In [8]:
#removed u'possibly_sensitive', at the moment since not all rows have it
#it goes between place and retweet_count in the below list
twitter_fields = [u'contributors', u'coordinates', u'created_at', u'entities', 
                  u'favorite_count', u'favorited', u'filter_level', u'geo', u'id',
                  u'id_str', u'in_reply_to_screen_name', u'in_reply_to_status_id', 
                  u'in_reply_to_status_id_str', u'in_reply_to_user_id', u'in_reply_to_user_id_str', 
                  u'lang', u'place',   u'retweet_count', u'retweeted', u'source', 
                  u'text', u'timestamp_ms', u'truncated', u'user']

#set the columns in the dataframe to match the json fields of the twitter streaming api
for t in twitter_fields:
    tweets[t] = map(lambda tweet: tweet[t] if tweet[t] else 'None', raw_data)
tweets['followers_count'] = map(lambda tweet: tweet['user']['followers_count'] if tweet['user'] != None else None, raw_data)
tweets['country'] = map(lambda tweet: tweet['place']['country'] if tweet['place'] != None else None, raw_data)

We will use many of these fields to do our analysis, draw interesting conclusions and visualize. For example, we are interested to find out how positive the sentiment of tweets from twitter users with the highest follower_counts is. We wondered if the top twitter users are paid to be positive and we wanted to see if the data reflects this possibility:

In [9]:
print "We can see the how many people are following the user of each tweet in the dataset - "
print tweets['followers_count'].head()

print "\n"

print "And it will be important to verify that the majority of the data is in American English since that is what we will be basing our sentiment analysis on -"
print tweets['lang'].head()
We can see the how many people are following the user of each tweet in the dataset - 
0      48
1    2234
2      16
3      35
4    1672
Name: followers_count, dtype: int64


And it will be important to verify that the majority of the data is in American English since that is what we will be basing our sentiment analysis on -
0    en
1    en
2    en
3    en
4    en
Name: lang, dtype: object

Now that we have collected the raw data and placed it in a pandas dataframe, let's take a look at some of the tweets text specifically:

In [10]:
#View the first 5 tweets of the dataset
pd.set_option('max_colwidth', 200)
tweets['text'].head(5)
Out[10]:
0                      #United States Football Tickets News https://t.co/82wIUcjYt6 https://t.co/cyGk7msfUQ
1    In fact I have rarely in my entire life (if ever?) felt more bamboozled by an airline. @SpiritAirlines
2                      #United States Football Tickets News https://t.co/USYm8mOueG https://t.co/35eZEfycvB
3                      #United States Football Tickets News https://t.co/SFVIa6D9dg https://t.co/owpvznBCrA
4                                       Complimentary #Citrix CTP WiFi on @Delta flight, always a pleasure!
Name: text, dtype: object

First we install the plotly library for visualization

In [11]:
#Run these two commands in Terminal to initialize plotly online
#you may need to make plotly account before doing this
#Use this link to setup plotly: https://plot.ly/python/user-guide/

#pip install plotly
#python -c "import plotly; plotly.tools.set_credentials_file(username='nruta', api_key='mwv4tll3ev')"
#pip install cufflinks
In [1]:
#run these using your username and key
import plotly.tools as tls
tls.set_credentials_file(username='ayinmv', api_key='rq66z3hqx8')

import plotly.plotly as py
from plotly.graph_objs import *
In [2]:
!pip install plotly
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot

print __version__ # requires version >= 1.9.0
#import cufflink package
import cufflinks as cf
print cf.__version__

init_notebook_mode() # run at the start of every ipython notebook to use plotly.offline
                     # this injects the plotly.js source files into the notebook
    
py.sign_in(username='ayinmv', api_key='rq66z3hqx8')
Requirement already satisfied (use --upgrade to upgrade): plotly in /anaconda2/anaconda/lib/python2.7/site-packages
Requirement already satisfied (use --upgrade to upgrade): pytz in /anaconda2/anaconda/lib/python2.7/site-packages (from plotly)
Requirement already satisfied (use --upgrade to upgrade): requests in /anaconda2/anaconda/lib/python2.7/site-packages (from plotly)
Requirement already satisfied (use --upgrade to upgrade): six in /anaconda2/anaconda/lib/python2.7/site-packages (from plotly)
1.9.2
0.7.1

It is important to verify that the majority of the tweets are in American English since the sentiment analysis will depend on it. We are using a word list that weighs certain key nouns/adjectives based on the sentiment associated with them. It assumes American English is passed to it.

We used a histogram to verify the top language as English:

In [3]:
#histogram of all the languages we are detecting in our tweets
tweets_by_lang = tweets['lang'].value_counts()
tweets_by_lang.iplot(kind='bar', yTitle='Languages', title='Languages')
tls.embed('https://plot.ly/~ayinmv/97')
Out[3]:

and another histogram to verify that the tweets are mostly from the USA:

In [4]:
#histogram of all the languages we are detecting in our tweets
tweets_by_country = tweets['country'].value_counts()
tweets_by_country.iplot(kind='bar', yTitle='Countries', title='Countries')
tls.embed('https://plot.ly/~ayinmv/138')
Out[4]:

We wanted to see how many tweets were about each of the top ten that we selected for the project. First, we create a function to find a word in the tweet text column of the pandas dataframe we created:

In [19]:
import re
#create a function to find the word in the tweet text field
def word_in_text(word, text):
    word = word.lower()
    text = text.lower()
    match = re.search(word, text)
    if match:
        return True
    return False

and then we used that function, word_in_text, to create new columns for each airline and include the total count for each:

In [20]:
#get the words in the text
airlines = ['southwest', 'delta', 'jetblue', 'united', 'flyfrontier', 'hawaiianair', 'virginamerica', 'alaskaair','spiritairlines', 'AmericanAir']

#create columns for each airline and set a boolean value to use for the below visualization of tweet count 
#for each airline
for a in airlines:
    if (tweets['text'].apply(lambda tweet: word_in_text(a, tweet))).count() > 0:
        tweets[a] = tweets['text'].apply(lambda tweet: word_in_text(a, tweet))

We can get the total count by adding up all of the rows, for a particular airline column of the dataframe, that have a True value. For example, here is the total count for Southwest Airline:

In [21]:
print len(tweets.loc[tweets['southwest'] == True])
11204
In [22]:
#view the amount of tweets per airline
# tweets_by_airlines = [len(tweets.loc[tweets['southwest'] == True]), 
#  len(tweets.loc[tweets['delta'] == True]), len(tweets.loc[tweets['jetblue'] == True]),
#          len(tweets.loc[tweets['united'] == True]), len(tweets.loc[tweets['flyfrontier'] == True]),
#                      len(tweets.loc[tweets['hawaiianair'] == True]),len(tweets.loc[tweets['virginamerica'] == True]),
#                      len(tweets.loc[tweets['alaskaair'] == True]),len(tweets.loc[tweets['spiritairlines'] == True]),
#                      len(tweets.loc[tweets['AmericanAir'] == True])]

# airlines = ['southwest', 'delta', 'jetblue', 'united', 'flyfrontier', 'hawaiianair', 'virginamerica', 'alaskaair','spiritairlines', 'AmericanAir']

# x_pos = list(range(len(airlines)))
# width = 0.3
# fig, ax = plt.subplots(figsize=(16,8))
# plt.bar(x_pos, tweets_by_airlines, width, alpha=1, color='g')
# # Setting axis labels and ticks
# ax.set_ylabel('Number of tweets', fontsize=15)
# ax.set_title('Ranking: Airlines (Raw data)', fontsize=10, fontweight='bold')
# ax.set_xticks([p + 0.9 * width for p in x_pos])
# ax.set_xticklabels(airlines)
# plt.grid()

We will need a new column for the dataframe called 'airline'. We created this for our spark text processing and LDA Topic modelling in addition to the visualizations we will create:

In [23]:
#create a dataframe called 'processed_data' with extra column 'airline'

#a function to set the airline name where the column value is True
def setAirlineName(row):   
    x = None
    for a in airlines:
        if row[a] is True:
            x = a
    if x is not None:
        return x
    else: 
        return

# set a column in the tweets dataframe for each airline name with a boolean value
for a in airlines:
    if (tweets['text'].apply(lambda tweet: word_in_text(a, tweet))).count() > 0:
        tweets[a] = tweets['text'].apply(lambda tweet: word_in_text(a, tweet))

#add the airline column with the airline name as the value 
tweets['airline'] = tweets.apply(lambda row: setAirlineName(row), axis=1)  

#remove the temp. columns for each airline
for a in airlines:
    tweets = tweets.drop(a, 1)

#filter to just English-language tweets, since all our language processing is English-specific
processed_data = tweets[tweets['lang'] == 'en']

#processed_data['airline'] values are - 'southwest', 'delta', 'jetblue', 'united', 'flyfrontier', 
# 'hawaiianair', 'virginamerica', 'alaskaair','spiritairlines', 'AmericanAir'
print 'The Processed Data File contains', len(processed_data), 'tweets.'
The Processed Data File contains 93737 tweets.

This new dataframe contains the 'airline' column we need:

In [24]:
processed_data['airline'].head()
Out[24]:
0            united
1    spiritairlines
2            united
3            united
4             delta
Name: airline, dtype: object

At one point, we thought it would be necessary to have a JSON file created from this new dataframe. We created a script to do that but eventually decided to use the dataframe directly in the Spark text processing application:

In [25]:
#create a json file from the processed_data pandas dataframe
#it has the extra airline field that will be used for spark processing

#This is commented out since we are using the processed_data dataframe for the spark processing
# with open('processed_data.json', 'w') as outfile:  
#     for index, row in processed_data.iterrows():
#         outfile.write(row.to_json())
#         outfile.write('\n')

Pass the data with airlines to Spark for text processing

In [26]:
from pattern.en import parse
from pattern.en import pprint
from pattern.vector import stem, PORTER, LEMMA
from sklearn.feature_extraction import text
from gensim import corpora 
In [27]:
import findspark
findspark.init()
print findspark.find()
/usr/local/Cellar/apache-spark/1.5.2/libexec/
In [28]:
import pyspark
In [29]:
# adapted from HW5
def get_parts(thetext, punc='.,;:!?()[]{}`''\"@#$^&*+-|=~_'):
    # generate stopwords list & regexes for 2+ periods or 2+ dashes
    stop = text.ENGLISH_STOP_WORDS
    regex1=re.compile(r"\.{2,}")
    regex2=re.compile(r"\-{2,}")
    thetext=re.sub(regex1, ' ', thetext)
    thetext=re.sub(regex2, ' ', thetext)
    punctuation = list(punc)
    nouns=[]
    descriptives=[]
    for i,sentence in enumerate(parse(thetext, tokenize=True, lemmata=True).split()):
        nouns.append([])
        descriptives.append([])
        for token in sentence:
            if len(token[4]) >0:
                if token[1] in ['JJ', 'JJR', 'JJS']:
                    if token[4] in stop or token[4][0] in punctuation or token[4][-1] in punctuation or len(token[4])==1:
                        continue
                    descriptives[i].append(token[4])
                elif token[1] in ['NN', 'NNS']:
                    if token[4] in stop or token[4][0] in punctuation or token[4][-1] in punctuation or len(token[4])==1:
                        continue
                    nouns[i].append(token[4])
    out=zip(nouns, descriptives)
    nouns2=[]
    descriptives2=[]
    for n,d in out:
        if len(n)!=0 and len(d)!=0:
            nouns2.append(n)
            descriptives2.append(d)
    return nouns2, descriptives2
In [30]:
# initialize Spark context
conf = pyspark.SparkConf().setAppName("Twitter_Airline").setMaster("local[*]")
sc = pyspark.SparkContext(conf=conf)

We read all the tweets from the dataframe into Spark, assigning each tweet a unique ID so we can track it through our sentiment analysis and LDA topic modeling.

In [31]:
# read tweets & associated airlines into Spark
tweets_text = sc.parallelize([(row['airline'], row['text']) for index, row in processed_data.iterrows()]).zipWithIndex().map(lambda ((air, txt), idx): ((idx, air), txt))
tweets_text.take(5)
Out[31]:
[((0, 'united'),
  u'#United States Football Tickets\xa0News https://t.co/82wIUcjYt6 https://t.co/cyGk7msfUQ'),
 ((1, 'spiritairlines'),
  u'In fact I have rarely in my entire life (if ever?) felt more bamboozled by an airline. @SpiritAirlines'),
 ((2, 'united'),
  u'#United States Football Tickets\xa0News https://t.co/USYm8mOueG https://t.co/35eZEfycvB'),
 ((3, 'united'),
  u'#United States Football Tickets\xa0News https://t.co/SFVIa6D9dg https://t.co/owpvznBCrA'),
 ((4, 'delta'),
  u'Complimentary #Citrix CTP WiFi on @Delta flight, always a pleasure!')]

Sentiment of a sentence based on log probs in a word list

Function to read the word list file:

In [32]:
import numpy as np
# read the word list
def readSentimentList(file_name):
    ifile = open(file_name, 'r')
    happy_log_probs = {}
    sad_log_probs = {}
    ifile.readline() #Ignore title row
    # splitting the csv
    for line in ifile:
        tokens = line[:-1].split(',')
        happy_log_probs[tokens[0]] = float(tokens[1])
        sad_log_probs[tokens[0]] = float(tokens[2])

    return happy_log_probs, sad_log_probs

Using Naive Bayes rule:

In [33]:
def classifySentiment(words, happy_log_probs, sad_log_probs):
    # get the log-probability of each word under each sentiment
    happy_probs = [happy_log_probs[word] for word in words if word in happy_log_probs]
    sad_probs = [sad_log_probs[word] for word in words if word in sad_log_probs]

    # sum all the log-probabilities for each sentiment to get a log-probability for the whole tweet
    tweet_happy_log_prob = np.sum(happy_probs)
    tweet_sad_log_prob = np.sum(sad_probs)

    # calculate the probability of the tweet belonging to each sentiment
    prob_happy = np.reciprocal(np.exp(tweet_sad_log_prob - tweet_happy_log_prob) + 1)
    prob_sad = 1 - prob_happy

    return prob_happy, prob_sad

Load the word list:

In [34]:
# load list of words and log probs
happy_log_probs, sad_log_probs = readSentimentList('wordlist.csv')

Reading in a sample tweet:

In [35]:
# read tweet
tweet1 = ['I', 'hate', 'southwest']

# calculate the probability
tweet1_happy_prob, tweet1_sad_prob = classifySentiment(tweet1, happy_log_probs, sad_log_probs)

print tweet1 
print "happy probability: " , tweet1_happy_prob 
print "sad probability:", tweet1_sad_prob
['I', 'hate', 'southwest']
happy probability:  0.280105168408
sad probability: 0.719894831592
In [36]:
# get words out for sentiment analysis
puncs = '.,;:!?()[]{}`''\"@#$^&*+-|=~_'
sentiment_words = tweets_text.mapValues(lambda t: t.strip(puncs).split())

# classify sentiment of tweet
tweets_probs = sentiment_words.mapValues(lambda ws: classifySentiment(ws, happy_log_probs, sad_log_probs))
happy_probs = tweets_probs.mapValues(lambda (hprob, sprob): (hprob, 1))
sad_probs = tweets_probs.mapValues(lambda (hprob, sprob): (sprob, 1))

Visualization

Create a dataframe to use for visualization.

In [37]:
#collect the happy probabilities from the above mapValues call
listOfHappyProbs = happy_probs.collect()

#create an empty list to contain the just the probability portion of happy_probs
HappyProbsList = []
for x in range(len(listOfHappyProbs)):
    HappyProbsList.append(listOfHappyProbs[x][1][0])

#create the following dataframe containing columns of data for visualization
# text, airline, positive, prob, created_at, favorite_count, retweet_count, followers_count
df = pd.DataFrame()
df['created_at'] = processed_data['created_at']
df['favorite_count'] = processed_data['favorite_count']
df['retweet_count'] = processed_data['retweet_count']
df['followers_count'] = processed_data['followers_count']
# df_data_visualization['text'] = processed_data['text']
df['airline'] = processed_data['airline']
df['prob'] = HappyProbsList

#a function to set a 1 if the happy probability is greater than .5 otherwise set it to 0
#we will use this for the visualization aspect
def setPositiveValues(row):
    if row['prob'] > 0.5:
        return 1
    else:
        return 0

#run the function on the prob column to create the positive column to determine if a tweet is positive or negative
#from the happy probability
df['positive'] = df.apply(setPositiveValues, axis=1)

#create the text column, remove the newlines from the text to resolve an issue with row creation in the .to_csv call
df['text'] = processed_data['text'].str.replace('\n',"")

#create a function to set the airline names for the visualizations. We wanted the names to look clean on the graphs:
def setChartNames(row):
    if row is 'united':
        return "United"
    elif row is 'southwest':
        return 'Southwest'
    elif row is 'delta':
        return "Delta"
    elif row is 'jetblue':
        return 'JetBlue'
    elif row is 'flyfrontier':
        return 'Frontier'
    elif row is 'hawaiianair':
        return 'Hawaiian'
    elif row is 'virginamerica':
        return 'Virgin'
    elif row is 'alaskaair':
        return 'Alaska'
    elif row is 'spiritairlines':
        return 'Spirit'
    elif row is 'AmericanAir':
        return 'American'
   
#set the airline names using the above function    
df['airline'] = df['airline'].apply(setChartNames)

#set the created_at column to a pandas datetime column
df['created_at'] = pd.to_datetime(df['created_at'], dayfirst=True )

#set the format of the dates to look clean on the visualizations to come
df['created_at'] = df['created_at'].map(lambda x: x.strftime('%m-%d-%Y'))

#create a .csv file to use for visualization 
df.to_csv('output.csv', encoding='utf-8', index=False)

Here we start with the visualizaton. But first, we need to remove the undecided tweets.

In [38]:
#getting rid of undecided tweets
df = df[(df.prob <> .5)]

Some of the positive tweets:

In [39]:
df[(df.positive == 1)].head(5)
Out[39]:
created_at favorite_count retweet_count followers_count airline prob positive text
1 11-30-2015 None None 2234 Spirit 0.540553 1 In fact I have rarely in my entire life (if ever?) felt more bamboozled by an airline. @SpiritAirlines
4 11-30-2015 None None 1672 Delta 0.966443 1 Complimentary #Citrix CTP WiFi on @Delta flight, always a pleasure!
5 11-30-2015 None None 189 American 0.873072 1 @judsonabts @AmericanAir the thing about it is they were my carry ons...
10 11-30-2015 None None 16720 Virgin 0.591014 1 .@United responds to @VirginAmerica entering #Denver - San Francisco market https://t.co/P47FlIPWXy https://t.co/et2OeZSZCe
11 11-30-2015 None None 16 American 0.745928 1 @AmericanAir another flight with American and again a super cramped seat. It's not an airplane it's a Sardine Can it seems like it.

Some of the negative tweets:

In [40]:
df[(df.positive == 0)].head(5)
Out[40]:
created_at favorite_count retweet_count followers_count airline prob positive text
16 11-30-2015 None None 469 Delta 0.399293 0 At this point it looks like I'll miss my connecting flight. #delta
18 11-30-2015 None None 332 American 0.387292 0 @AmericanAir you guys fucking blow.
21 11-30-2015 None None 23 Delta 0.267162 0 RT @ciarahanna20: Get me in the air for freaks sake @Delta
34 11-30-2015 None None 1040 Southwest 0.347796 0 Planes need LUV too! @SouthwestAir @SouthwestTheMag #crewlife #avgeek https://t.co/HOmqld7zkH
37 11-30-2015 None None 20 Delta 0.062909 0 RT @BillyChipp: We have now deplaned the @Delta aircraft. Settling in for a long delay and still missing my wife! 😢 #flightdelay #Delta

We can aggregate all the happy & sad probabilities for individual tweets about each airline into an average probability that users are happy or sad about that airline:

In [41]:
airline_happy_probs = happy_probs.map(lambda ((idx, air), probs): (air, probs)).reduceByKey(lambda (p1,num1),(p2,num2): ((num1*p1 + num2*p2)/(num1 + num2), num1 + num2)).mapValues(lambda (p, n): p) 
airline_sad_probs = sad_probs.map(lambda ((idx, air), probs): (air, probs)).reduceByKey(lambda (p1,num1),(p2,num2): ((num1*p1 + num2*p2)/(num1 + num2), num1 + num2)).mapValues(lambda (p, n): p) 
print "Happy probabilities:", airline_happy_probs.take(5)
print "Sad probabilities:", airline_sad_probs.take(5)
Happy probabilities: [('alaskaair', 0.6999773908606034), (None, 0.66654926800966996), ('hawaiianair', 0.72128013082123421), ('united', 0.6481662436437734), ('spiritairlines', 0.59089167381864605)]
Sad probabilities: [('alaskaair', 0.30002260913939632), (None, 0.33345073199032976), ('hawaiianair', 0.27871986917876562), ('united', 0.35183375635623493), ('spiritairlines', 0.40910832618135362)]

We can group the dataframe by airlines. Showing number of Tweets per airline:

In [5]:
airline_count = df.groupby(['airline']).count().prob.sort(axis=0, ascending=False, inplace=False)
airline_count.iplot(kind='bar', yTitle='Number of Tweets', title='Number of Tweets Per Airline')

tls.embed('https://plot.ly/~ayinmv/112')
Out[5]:

Score of the airlines based on the average of their positive scores

In [6]:
airline_count = df.groupby(['airline']).mean().prob.sort(axis=0, ascending=False, inplace=False)
airline_count.iplot(kind='bar', yTitle='Average Score', title='Average Score')

tls.embed('https://plot.ly/~ayinmv/63')
Out[6]:

To see the scores in a more defined way, we subtract average score and plot on a bar chart:

In [7]:
# Learn about API authentication here: https://plot.ly/pandas/getting-started
# Find your api_key here: https://plot.ly/settings/api
import plotly.plotly as py
import plotly.graph_objs as go

#making this dataframe to set the lables
dfx = pd.DataFrame([airline_count]).transpose().reset_index(level=0)

data = [
    go.Bar(
        x=dfx['airline'], # assign x as the dataframe column 'x'
        y=airline_count - np.mean(airline_count)
    )
]

# IPython notebook
df.iplot(data, yTitle='Average Score', title='Score Difference from Average Score')

tls.embed('https://plot.ly/~ayinmv/123')
Out[7]:

Here we explore the opportunity to point out the most important tweets. We can for example sort by the tweet wit most followers:

In [46]:
pd.set_option('max_colwidth', 200)
#showing a subset of dataframe
df.sort(columns='followers_count', ascending=[False])[['followers_count','text', 'prob', 'created_at']].head(5)
Out[46]:
followers_count text prob created_at
89254 9561279 @beauflynn @VirginAmerica Yea baby! Fav airline to fly outta FLL. Safe travels bro great seeing you 0.999046 11-29-2015
28744 4546595 A couple teammates took in a different form of elite training during the off-day in Atlanta at @Delta​ HQ: https://t.co/pDpWK0aeKc 0.882898 12-03-2015
32985 3313970 @AmericanAir thank you for the concierge key this year! hope i get renewed for 2016! 0.916204 12-04-2015
32942 3313940 me &amp; @R3HAB waited until our row fell asleep then made a lil banger here on @americanair studios. 0.310402 12-04-2015
29952 3068510 Had the best time spreading cheer today @delta's Holiday in the Hangar event! Check out this winter wonderland! #DeltaLAX #DeltaGreaterGood 0.704560 12-04-2015

We can take a better look at the data with quantile plots. We pull groupby airline output data.

In [12]:
# quantile plot for each airline
dfgroupby = pd.read_csv("outputgroupby.csv")
dfgroupby.iplot(kind='box', title='Score Quantiles Graph')
tls.embed('https://plot.ly/~ayinmv/147')
Out[12]:

Here we would like to see if users with the highest followers count post unusually positive sentiment. We are suspecting that many of the most popular users have incentive to post positive tweets.

In [49]:
#making the data frame of categories of follower counts
dffollow = pd.DataFrame()
dffollow['100']=df[(df.followers_count > 100)].groupby(['airline']).mean().prob.sort(axis=0, ascending=False, inplace=False)
dffollow['1K']=df[(df.followers_count > 1000)].groupby(['airline']).mean().prob.sort(axis=0, ascending=False, inplace=False)
dffollow['10K']=df[(df.followers_count > 10000)].groupby(['airline']).mean().prob.sort(axis=0, ascending=False, inplace=False)
dffollow['100K']=df[(df.followers_count > 100000)].groupby(['airline']).mean().prob.sort(axis=0, ascending=False, inplace=False)
dffollow['1M']=df[(df.followers_count > 1000000)].groupby(['airline']).mean().prob.sort(axis=0, ascending=False, inplace=False)
dffollow
Out[49]:
100 1K 10K 100K 1M
airline
Virgin 0.759421 0.763725 0.751168 0.709281 0.702845
Southwest 0.747498 0.763970 0.738084 0.722091 0.670483
Hawaiian 0.744871 0.747934 0.816992 0.858927 NaN
Alaska 0.738351 0.746514 0.731651 0.725197 NaN
JetBlue 0.738120 0.746744 0.730362 0.724806 0.691332
Frontier 0.716418 0.737914 0.722606 0.465433 NaN
Delta 0.715612 0.720521 0.719754 0.702719 0.767940
United 0.688019 0.731144 0.725874 0.712427 0.496645
American 0.680584 0.699743 0.716007 0.674076 0.581509
Spirit 0.582546 0.634357 0.607038 0.018142 NaN
In [9]:
# quantile plot for each airline
dffollow.iplot(kind='box', title='Score Quantiles')
tls.embed('https://plot.ly/~ayinmv/83')
Out[9]:

We did not see a particular trend to prove our hypothesis about the most popular users. The users that have some sort of incentive to post unusually positive tweets may only represent a small sample of the population.

LDA on nouns for topic analysis

LDA, or Latent Dirichlet Allocation, is a topic modeling algorithm that, given a corpus of documents, generates a set of topics associated with related words, and then assigns each document a likelihood of belonging to each topic using information about the corpus as well as random seeding. LDA determines how closely related two words are on the basis of their co-occurrence in one or more documents. However, words that do not appear in any documents together can end up associated with the same topic, if they both co-occur with another word. Even longer chains can also form topics – i.e., word A co-occurs with B, which co-occurs with C, which co-occurs with D, which co-occurs with E, and A through E could end up all in the same topic even though A and E neither co-occur nor share a word they co-occur with.

Note also that because LDA is a stochastic model, if you re-run the rest of this notebook, your results may be different from ours. The explanations accompanying our results are specific to a particular run of the notebook and will not describe the results you see if you re-run the notebook. However, the underlying principles should generalize to any run of the notebook.

We separate out the nouns from each tweet to feed into our LDA model, and create a gensim dictionary of all the nouns. We don't want to deal with misspellings or other terms that appear only once or twice in our dataset, so we filter terms that appear too few times out of our dictionary with dictionary.filter_extremes(), and then use dictionary.compactify() to remove gaps in the indices of the dictionary left by removing those terms.

The function dictionary.filter_extremes() also removes any terms that appear in more than half of all documents in the dataset, but these are generally stopwords (i.e., extremely common and uninformative words like "and" or "the"), which we have already removed in a previous step.

In [37]:
from operator import add
In [38]:
# parse nouns out of tweets
tweets_n_a = tweets_text.mapValues(get_parts)
tweets_nouns = tweets_n_a.mapValues(lambda (n, adj): n)
print tweets_nouns.take(5)
all_nouns = tweets_nouns.flatMapValues(lambda l: l).values().toLocalIterator()

# feed nouns into gensim
dictionary = corpora.Dictionary(all_nouns)
dictionary.filter_extremes()
dictionary.compactify()
[((0, 'united'), []), ((1, 'spiritairlines'), [[u'fact', u'life']]), ((2, 'united'), []), ((3, 'united'), []), ((4, 'delta'), [[u'flight', u'pleasure']])]

We then vectorize the nouns in each tweet (the first map is just to get our input in the form that gensim's doc2bow function wants, while the filter removes tweets that didn't contain any identifiable nouns).

In [39]:
doc_vecs = tweets_nouns.mapValues(lambda n: reduce(add, n, [])).filter(lambda (k, v): v).mapValues(dictionary.doc2bow)
corpus = doc_vecs.values().collect()

We use Latent Dirichlet Allocation (LDA) from gensim to find latent topics within our tweets. LDA maps our corpus of documents (in this case, tweets) from the high-dimensional space of bag-of-words vectors (in which each unique token can be thought of as a "dimension") to a lower-dimensional space of a specific number of topics.

In [43]:
len(dictionary.keys())
Out[43]:
2442
In [40]:
NUM_TOPICS = 100

So in this case, our corpus represented as bag-of-words vectors can be thought of as a 2442-dimensional space, which is very difficult to work with to determine document similarities and differences. Instead, we will represent it as a 100-dimensional space of LDA topics, and in particular we will focus on just fifteen of the "best" topics returned by our LDA model. This will make it much more manageable to come to conclusions about the content of our documents.

In [41]:
from gensim.models.ldamodel import LdaModel
In [45]:
lda = LdaModel(corpus=corpus, num_topics=NUM_TOPICS, id2word=dictionary, passes=2)

Printing the top topics, we can see the top terms associated with each topic and the "coherence score" of the topic, a measure of how closely related the terms in each topic are.

The coherence score is calculated as: $$C(t; V^{(t)}) = \sum_{m=2}^{M} \sum_{l=1}^{m-1} log \frac {D\left(v^{(t)}_m, v^{(t)}_l\right) + 1} {D\left(v^{(t)}_l\right)}$$ where $V^{(t)} = \left(v^{(t)}_1, ..., v^{(t)}_M\right)$ is a list of the $M$ most probable words for topic $t$, $D(v)$ is the document frequency of the word v, and $D(v, v')$ is the co-document frequency of the words $v$ and $v'$, i.e., the number of documents in which both words appear (see the original paper for more details).

The coherence score thus measures roughly how likely it is that the words associated with a given topic are actually conceptually related to each other. The absolute score isn't terribly useful on its own, since the range of this function depends on the size of the corpus, length of documents, etc., but it's a useful way of ranking topics relative to each other on how likely they are to be informative. We thus take the top topics by coherence score to perform further analysis on.

In [46]:
best_topics = lda.top_topics(corpus)[:15]
for idx, tpc in enumerate(best_topics):
    print "Topic", idx, ":"
    print "    score:", tpc[1]
    print "    terms:", tpc[0][:10]
Topic 0 :
    score: -846.553206691
    terms: [(0.20945266560108269, u'cabin'), (0.16513897482397608, u'snack'), (0.10435986118247431, u'check-in'), (0.058273069930404653, u'bonus'), (0.036724892600328136, u'cart'), (0.034101219840475962, u'sabokitty'), (0.032601709302002159, u'counter'), (0.030515063666352135, u"int'l"), (0.029246208573376644, u'efficiency'), (0.028231300471702785, u'banquet')]
Topic 1 :
    score: -880.20264645
    terms: [(0.13555936181729164, u'instrument'), (0.12101637566853622, u'process'), (0.099360134130023675, u'bore'), (0.071013700727874274, u'town'), (0.060400839670117695, u'pound'), (0.047051495180599555, u'clueless'), (0.044038118684497161, u'time.it'), (0.041764711934958648, u'sticker'), (0.035259050462623648, u'insurance'), (0.030745274077872302, u'shite')]
Topic 2 :
    score: -886.974511585
    terms: [(0.2236348301667356, u'school'), (0.11473742471433415, u'priority'), (0.10585691288787849, u'shaymitch'), (0.074905125344053899, u'effort'), (0.069146398517420757, u'center'), (0.057432361295920401, u'transit'), (0.031763575769642956, u'shirt'), (0.021827745477087143, u'annual'), (0.021187083777231015, u'discussion'), (0.020242426476561468, u'alaskamuseum')]
Topic 3 :
    score: -923.221286919
    terms: [(0.85980734533897762, u'time'), (0.053632624969417904, u'price'), (0.011527374201888069, u'contest'), (0.010448186944942218, u'code'), (0.0086521853468425145, u've'), (0.00818761702404135, u'emmaawoodman'), (0.0068903061719785405, u'lb'), (0.006275031743184163, u'thousand'), (0.0062453120783072134, u'cyber'), (0.0050501545928709262, u'disability')]
Topic 4 :
    score: -941.089086337
    terms: [(0.31143463910498809, u'courtesy'), (0.12742628931051905, u'http\u2026'), (0.05564130353977164, u'state'), (0.041971134938419224, u'seatacairport'), (0.04124769969343426, u'datum'), (0.038600851831483235, u'laptoptravel'), (0.034082365045011033, u'war'), (0.02860317640978579, u'disappointment'), (0.024126008817158894, u'standing'), (0.021220078879878243, u'struggle')]
Topic 5 :
    score: -944.262582078
    terms: [(0.33084551505851567, u'lot'), (0.17333701368291657, u'lax'), (0.13013157088470853, u'fence'), (0.040436542577375327, u'pair'), (0.037121219499245794, u'logo'), (0.031175408454158234, u'pat'), (0.0300717933133881, u'shoe'), (0.028152106003590704, u'organization'), (0.025940549021496472, u'precheck'), (0.022732107722102868, u'south')]
Topic 6 :
    score: -948.49993652
    terms: [(0.25748813398920706, u'avgeek'), (0.13935289532899334, u'lack'), (0.10320104854373992, u'size'), (0.060321194043677626, u'deltaairline'), (0.045195788726625656, u'respect'), (0.033175686971354035, u'notification'), (0.030820613407433094, u'camera'), (0.030538562347632216, u'loner'), (0.025561938972041159, u'deltaone'), (0.021350307862935378, u'dfwtower')]
Topic 7 :
    score: -952.242228029
    terms: [(0.12159141346890792, u'americanairline'), (0.10540495320151286, u'star'), (0.081421062217740897, u'value'), (0.051675154207232489, u'reminder'), (0.04704850662086861, u'petition'), (0.042816968156695459, u'p\u2026'), (0.042601386720823656, u'final'), (0.039937805386733645, u'ceo'), (0.039216753402485996, u'galaxy'), (0.036844032730668851, u'b\u2026')]
Topic 8 :
    score: -963.976799723
    terms: [(0.78672113556977663, u'thing'), (0.050194933945640972, u'airplane'), (0.033046603890821175, u'coffee'), (0.028355946427932974, u'flt'), (0.024932752294522128, u'aviation'), (0.0081349963439381785, u'magic'), (0.007227684092340733, u'operator'), (0.0061031543181435275, u'alley'), (0.0053328804163362536, u'demand'), (0.0049251209578879945, u'toilet')]
Topic 9 :
    score: -970.931296169
    terms: [(0.36216375302602749, u'tv'), (0.21431340125696863, u'club'), (0.089338372984146514, u'message'), (0.043936133740791743, u'quality'), (0.037875789114424387, u'hell'), (0.032358262800866809, u'text'), (0.02911677096727134, u'mood'), (0.020848186803565132, u'analytic'), (0.020062690303510194, u'upset'), (0.019722198840548981, u'pls')]
Topic 10 :
    score: -972.461343972
    terms: [(0.14295860674466301, u'difference'), (0.11813709981594646, u'racist'), (0.10875670180833247, u'soldier'), (0.088398723282354597, u'thepointsguy'), (0.078882400567733268, u'prop'), (0.053458450081779824, u'paper'), (0.047524462891480254, u'peanut'), (0.043736957863822541, u'chip'), (0.039659598547154423, u'lawsuit'), (0.033123336598126968, u'sex')]
Topic 11 :
    score: -976.187670365
    terms: [(0.94490042932225771, u'flight'), (0.011492676883819041, u'hopper'), (0.0055779015065598745, u'attention'), (0.0043373346672962755, u'gold'), (0.0039147084896265808, u'netflix'), (0.0034435730267496968, u'cookies'), (0.0022717782060709328, u'cent'), (0.002108150509181869, u'anxiety'), (0.0020369675340306246, u'sprint'), (0.0019146453929025325, u'wallet')]
Topic 12 :
    score: -978.594052186
    terms: [(0.22568739591755585, u'athlete'), (0.14370266124201583, u'partner'), (0.095278527039923899, u'nonstop'), (0.068427293351423446, u'highlight'), (0.060495661445091491, u'failing'), (0.044761370608579341, u'steel'), (0.034827501068038905, u'icon'), (0.034444024053367425, u'comfort'), (0.02879458117171968, u'fog'), (0.024605422021900691, u'travels')]
Topic 13 :
    score: -980.532081183
    terms: [(0.50740574727884336, u'ticket'), (0.12358011936014265, u'website'), (0.090342997788173743, u'yr'), (0.036985924586155544, u'type'), (0.03094365909146965, u'head'), (0.03063753872218155, u'sunrise'), (0.030155334493417691, u'mountain'), (0.026074485492230196, u'concern'), (0.012119822614656385, u'booze'), (0.011802330555140123, u'breakfast')]
Topic 14 :
    score: -985.347111783
    terms: [(0.2074228862963004, u'airfarewatchdog'), (0.19050638577022611, u'weather'), (0.112008193850806, u'read'), (0.10531176001845891, u'trophy'), (0.051246317596785246, u'talk'), (0.036307393721145872, u'treat'), (0.035307884666738988, u'champion'), (0.031770575180548127, u'progress'), (0.022056620630004971, u'robot'), (0.022044258769509075, u'vintage')]

Even after sorting by coherence score, many of these topics do not obviously correspond to a particular human-identifiable concept. Because tweets are so short, LDA has trouble coming up with enough information to assign terms to different topics in a way consistent with human understanding. However, we do have Topic 0, which includes "check-in", "counter", "cart", and "efficiency" among its highest-probability terms, and Topic 13, which includes "ticket" and "website". These give some indication of possible topics we might be interested in.

These ranks for the topics based on coherence score aren't the internal topic ids that gensim uses, though, so we have to find these internal ids for each of the topics in question.

In [47]:
all_topics = lda.show_topics(NUM_TOPICS, formatted=False)
In [48]:
def get_topic_id(topic, all_topics):
    """
    Matches the topics returned by top_topics to their ids in the LDA model
    by checking for term overlap.
    """
    for tpc in all_topics:
        if len(set([t[0] for t in tpc[1]]) & set([t[1] for t in topic[0]])) == 10:
            return tpc[0]
In [62]:
best_topics_ids = [get_topic_id(topic, all_topics) for topic in best_topics]
for idx, tpc in enumerate(best_topics_ids):
    print "Topic", idx, "above has ID", tpc
Topic 0 above has ID 78
Topic 1 above has ID 66
Topic 2 above has ID 37
Topic 3 above has ID 76
Topic 4 above has ID 75
Topic 5 above has ID 48
Topic 6 above has ID 62
Topic 7 above has ID 82
Topic 8 above has ID 38
Topic 9 above has ID 64
Topic 10 above has ID 54
Topic 11 above has ID 9
Topic 12 above has ID 24
Topic 13 above has ID 61
Topic 14 above has ID 34

We can use these IDs to figure out which of the above topics each tweet is most associated with.

Now that we have a bunch of topics and we've assigned each tweet a probability of being "happy" or "sad" (i.e., expressing positive or negative sentiment), we can calculate what topics are most associated with positive or negative tweets.

We start by determining how likely each tweet is to be associated with each of our best topics.

In [50]:
def get_best_topics(bow):
    my_topics = lda.get_document_topics(bow)
    my_best_topics = [tpc for tpc in my_topics if tpc[0] in best_topics_ids]
    return my_best_topics

We only keep tweets that have a non-negligible likelihood of belonging to one of our top topics (using the filter statement below).

In [51]:
tweets_topics = doc_vecs.mapValues(get_best_topics).filter(lambda (k,v): v)
tweets_topics.take(5)
Out[51]:
[((4, 'delta'), [(9, 0.33666666666666867)]),
 ((11, 'AmericanAir'), [(9, 0.33665415693021578)]),
 ((21, 'delta'), [(9, 0.33635326484076972)]),
 ((22, 'delta'), [(9, 0.33633022258042972)]),
 ((25, 'delta'), [(9, 0.33563526627386786)])]

We can pull out the correlation scores of each tweet to each topic and then plot a histogram for each topic showing its distribution of correlation scores.

In [52]:
# Get a list of all the tweets' correlation scores to each of our topics
scores_per_topic = tweets_topics.values().flatMap(lambda v: v).mapValues(lambda v: [v]).reduceByKey(add)
# Store it as a dict, keyed by the topic ID
hist_input = scores_per_topic.collectAsMap()
In [53]:
fig, axes = plt.subplots(5,3, sharex=True)
fig.set_size_inches(16, 12)
for idx, ax in enumerate(axes.ravel()):
    ax.hist(hist_input[best_topics_ids[idx]], 20, range=[0,1])
    ax.set_title('Topic {} (id {}): {} total tweets'.format(idx, best_topics_ids[idx], len(hist_input[best_topics_ids[idx]])))

For most topics, there are many tweets that are a little bit related to the topic, and fewer that the model says are very closely related to the topic. There also appears to be a disproportionate concentration of correlation scores in the 0.5 bin, particularly compared to the bins immediately surrounding it. This is likely due to a mathematical quirk of LDA, which is particularly visible because tweets are very short documents.

In general, the LDA topic model is not hugely confident in its assignment of individual tweets to topics. This is not surprising, because an individual tweet contains very few words (and even fewer nouns recognizable by the part-of-speech tagger, particularly given that tweets do not always follow standard grammatical structure), so there is not very much basis on which to assign a tweet to one topic or another. One way to improve this would be to use a part-of-speech tagger trained on a Twitter corpus, so that more information could be extracted from each tweet. However, 140 characters is never going to contain a huge number of words for the LDA model to base its topic assignment on, so automatic topic assignments of tweets are inherently going to be less certain than automatic topic assignments of longer documents. Even though our topic results may be somewhat difficult to interpret, they still illustrate how LDA can be used to break down a complex corpus into features, which can then be used to break down sentiment analysis by topic.

Next, we combine our topic RDD with our sentiment RDD to convert the topics and sentiment scores associated with each tweet into sentiment scores associated with each topic, and with each topic-airline combination. We will weight a tweet's contribution to each topic's sentiment score by its correlation score to that topic.

In [54]:
# Join topics and sentiment probabilities by key
combined = tweets_topics.join(tweets_probs)
combined.take(5)
Out[54]:
[((36603, 'southwest'),
  ([(61, 0.25250000000000056)], (0.76802095969987361, 0.23197904030012639))),
 ((61427, 'delta'),
  ([(9, 0.25207955401400817)], (0.99993794826303162, 6.205173696838262e-05))),
 ((78144, 'jetblue'),
  ([(61, 0.33666666666666734)], (0.99612894641769567, 0.0038710535823043291))),
 ((43903, 'united'),
  ([(24, 0.50250000000000195)], (0.95717728750125242, 0.042822712498747584))),
 ((77739, 'flyfrontier'),
  ([(61, 0.12625000000000031)], (0.99748970111432333, 0.002510298885676665)))]

We then rearrange the data to make it possible to average sentiment contributions across all tweets associated with a given topic.

In [55]:
def regroup_by_topic(topics_probs, happy_prob, sad_prob):
    topic_sentiments = []
    for topic, prob in topics_probs:
        topic_sentiments.append((topic, (prob, happy_prob, sad_prob)))
    return topic_sentiments
In [56]:
regrouped = combined.mapValues(lambda (tps, (hp, sp)): regroup_by_topic(tps, hp, sp)).flatMapValues(lambda v: v)
regrouped.take(5)
Out[56]:
[((36603, 'southwest'),
  (61, (0.25250000000000056, 0.76802095969987361, 0.23197904030012639))),
 ((61427, 'delta'),
  (9, (0.25207955401400817, 0.99993794826303162, 6.205173696838262e-05))),
 ((78144, 'jetblue'),
  (61, (0.33666666666666734, 0.99612894641769567, 0.0038710535823043291))),
 ((43903, 'united'),
  (24, (0.50250000000000195, 0.95717728750125242, 0.042822712498747584))),
 ((77739, 'flyfrontier'),
  (61, (0.12625000000000031, 0.99748970111432333, 0.002510298885676665)))]

Now we combine the probabilities per topic to find out what airline-related topics people on Twitter are most positive or negative about.

In [57]:
weighted_avg_probs = lambda (p1, hp1, sp1), (p2, hp2, sp2): (p1 + p2, (p1*hp1 + p2*hp2)/(p1 + p2), (p1*sp1 + p2*sp2)/(p1 + p2))
happy_sad_probs = lambda (p, hp, sp): (hp, sp)
In [58]:
topics_sentiments = regrouped.values().reduceByKey(weighted_avg_probs).mapValues(happy_sad_probs)
print topics_sentiments.take(5)
topic_sent_dict = topics_sentiments.collectAsMap()
[(64, (0.86043113418013251, 0.13956886581986744)), (48, (0.88799282017470205, 0.11200717982529795)), (34, (0.90333497804761853, 0.096665021952381483)), (82, (0.85483282528218318, 0.14516717471781673)), (66, (0.91759465815347063, 0.082405341846529173))]
In [59]:
index = np.arange(len(best_topics_ids))
happy_probs_by_topic = np.array([topic_sent_dict[idx][0] for idx in best_topics_ids])
order = np.argsort(happy_probs_by_topic)[::-1]
width = 0.5
fig, ax = plt.subplots(figsize=(16,8))
plt.bar(index, happy_probs_by_topic[order], width, alpha=1, color='violet')
# Setting axis labels and ticks
ax.set_xlim((-width, len(best_topics_ids)))
ax.set_ylabel('Happy Probability', fontsize=15)
ax.set_xlabel('Topic ID', fontsize=15)
ax.set_title('Twitter Happiness by Topic', fontsize=15, fontweight='bold')
ax.set_xticks([i + 0.5 * width for i in index])
ax.set_xticklabels(np.array(best_topics_ids)[order])
plt.grid()
plt.show()

The airline-related topic that Twitter users are most positive about overall is the topic with ID 37, which is characterized by the terms "priority", "effort", and "transit". These tweets are likely about general experiences with the airline, particularly priority programs and airline workers.

The airline-related topic that Twitter users are least positive about overall is the topic with ID 62, which is characterized by the terms "lack", "size", and "respect", as well as picking up the specific airline-related terms of "deltaairline" and "deltaone". The low happiness associated with this category is likely a reflection of negative tweets across the board about concerns regarding insufficient space on planes and insufficient respect for passengers, but may also reflect pockets of frustration with Delta, despite the fact that Delta received a higher average happiness score overall than any other airline.

Another comparatively negative topic is ID 75, characterized by "courtesy", "standing", and "struggle". This category may be more difficult to interpret as a human-understandable concept due to its lower coherence score, but we could infer that tweets about encounters with airline personnel or other passengers might be found in this category.

We're less interested in what people think of different aspects of airlines overall than in how individual airlines score in each of these categories, though. Let's re-group our data by airline and topic and see how each airline stacks up on each topic.

Like above, we work with the joined RDD and extract weighted happy and sad probabilities, but this time we use airline, topic pairs as our keys rather than just topics alone. This breaks down how happy or sad Twitter users are about each topic specifically with regards to each airline.

In [60]:
airline_topics_sentiments = regrouped.map(lambda ((idx, air), (tpc, probs)): ((air, tpc), probs)).reduceByKey(weighted_avg_probs).mapValues(happy_sad_probs)
print airline_topics_sentiments.take(5)
air_tpc_sent_dict = airline_topics_sentiments.collectAsMap()
[(('virginamerica', 66), (0.97749367202012738, 0.022506327979872561)), (('delta', 75), (0.78633422427458299, 0.21366577572541687)), ((None, 78), (0.60213065902775109, 0.39786934097224902)), (('virginamerica', 82), (0.5221810415945578, 0.4778189584054422)), (('virginamerica', 34), (0.89405529379852799, 0.10594470620147198))]

We can graph the average happiness associated with tweets about each airline in each topic, again weighted by each tweet's degree of association with the topic in question. Every airline except Hawaiian Air has at least one tweet associated with each topic.

In [61]:
index = np.arange(len(airlines))
colors = ['#975A7A', '#78C0E0', '#FFB3AA', '#4F86C6', '#17A398', '#EE8434', '#FFD5FF', '#FFBC00', '#009FFD', '#8565A0']
fig, axes = plt.subplots(5, 3, figsize=(16,16), sharex=True, sharey=True)
for tpc, ax in zip(best_topics_ids, axes.ravel()):
    happy_probs_by_airline = [air_tpc_sent_dict.get((air, tpc), [0])[0] for air in airlines]
    width = 0.5
    barlist = ax.bar(index, happy_probs_by_airline, width, alpha=1)
    for bar, color in zip(barlist, colors):
        bar.set_color(color)
    # Setting axis labels and ticks
    ax.set_xlim((-width, len(airlines)))
    ax.set_ylabel('Happy Probability')
    ax.set_xlabel('Airline')
    ax.set_title('Topic ID {}'.format(tpc))
    ax.set_xticks([])
fig.legend(barlist, airlines, loc=(0.2,0.94), ncol=5)
plt.show()

The missing bars for Hawaiian Air in topics 78 and 37 indicate no data (not zero probability of happiness).

We can see that even per topic, most tweets are generally rated positive by our sentiment analysis. However, there are some topics where a few airlines stand out as substantially better or worse than others. In particular, Southwest ranks poorly on topics 78 ("snack", "check-in", "bonus", "cart", "counter") and 48 ("fence", "pat", "shoe", "pre-check") but is otherwise associated with as much or more happiness than other airlines. This might indicate that customers are unsatisfied with Southwest's check-in process and experiences going through security, but otherwise pretty happy with the airline.

Hawaiian Air ranks quite poorly on topic 54, which could be an indictment of their snacks since that topic contains "peanut" and "chip", but it could also be a legal problem or a problem with their treatment of military passengers since the topic also contains the terms "soldier" and "lawsuit". Again, the uncertainty in the topics generated by LDA on our dataset is a function of how short tweets are, meaning that they don't provide very much information about what terms are related to each other to form topics in the first place, and also don't provide much information about what topics they might belong to.

Overall, we found that even though our dataset of tweets is not optimal for generating highly informative LDA topics, breaking down our sentiment analysis by topic shows us underlying differences in sentiment between airlines in different aspects of their service. With a larger dataset, we might be able to offset the shortness of individual tweets to generate more useful topics. However, longer documents – e.g., Facebook posts about airlines – would be even more likely to reveal useful categories that people discuss with regards to airlines.

Future Work

If we had more time, we would have liked to try the following:

  1. We had a goal of collecting at least 100k tweets, given the time constraint. We accomplished this but we would have liked to collect a much larger dataset. This would have provided more robust algorithmic performance and a better result from the topics derived from the LDA process.

  2. We would like to do more work on the Naive Bayes Probabilities dataset in order to improve the sentiment analysis algorithm accuracy.

  3. We plan to create a word set of 'amplifiers' to exaggerate the positivity or negativity of tweets that include certain words. For example, the word incredible could make a tweet more positive or negative.

  4. We would like to work towards a highly scalable version of this workflow. This would be needed for larger datasets.

  5. We would like to be able to weigh tweets based on sentiment analysis of a user's past twitter history. This would contribute to a better learning model.